介绍
问题:e-greedy总是走巾cliff的问题?
off-policy: 学习中使用的策略和最后估算的策略不同(使用non-optimal policy, 估算 optimal policy),学一套 做一套
grid world with cliff:
| 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| 5 | 6 | 7 | 8 | 9 |
| 10 | 11 | 12 | 13 | 14 |
| s | cliff | cliff | cliff | goal |
Q-Learning算法
1 | # ϵ-greedy: |
强化学习算法:
TD learning:
V(St)<–V(St) + α[Rt+1 + γV(St+1)-V(St)]
on policy TD COntrol(SARSA) :
Q(St,At) <– Q(St,At) + α[Rt+1 +γQ(St+1,At+1)-Q(St,At)]
Off-policy TD COntrol(Q-learning):
Q(St,At) <– Q(St,At) + α[Rt+1 +γmaxQ(St+1,At+1)-Q(St,At)]
代码实现
1 | # environment: grids with size m*n; goal / cliff grid / start point(down left coner) |
1 | import numpy as np |
1 | # Environment setup: Sutten book example 6.6 cliff walk |
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. -100. -100. -100. -100. -100. -100. -100. -100. -100. -100. 1.]]
1 | # learning setup |
1 | # start learning |
1 | #第199个Nepis,R值为:-13。 |
赏
使用支付宝打赏
使用微信打赏
若你觉得我的文章对你有帮助,欢迎点击上方按钮对我打赏
扫描二维码,分享此文章